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Gene-gene interactions may contribute to tine genetic variation underlying connplex traits 
but have not always been taken fully into account. Statistical analyses that consider 
gene-gene interaction may increase the power of detecting associations, especially for 
low-marginal-effect markers, and may explain in part the "missing heritability." Detecting 
pair-wise and higher-order interactions genome-wide requires enormous computational 
power. Filtering pipelines increase the computational speed by limiting the number of 
tests performed. We summahze existing filtering approaches to detect epistasis, after 
distinguishing the purposes that lead us to search for epistasis. Statistical filtenng includes 
quality control on the basis of single marker statistics to avoid the analysis of bad and 
least informative data, and limits the search space for finding interactions. Biological 
filtenng includes targeting specific pathways, integrating various databases based on 
known biological and metabolic pathways, gene function ontology and protein-protein 
interactions. It is increasingly possible to target single-nucleotide polymorphisms that 
have defined functions on gene expression, though not belonging to protein-coding genes. 
Filtering can improve the power of an interaction association study, but also increases the 
chance of missing important findings. 
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INTRODUCTION 

Genome-wide association studies (GWAS) and next generation 
sequencing association studies based on single marker tests can 
identify many associated genetic variants, but typically explain 
only a small portion of the total estimated heritability. Gene-gene 
interactions may play an important role in the genetic etiology 
underlying complex phenotypes and statistical analyses that con- 
sider interaction may increase the power to detect epistatic genetic 
associations, especially among low-marginal-effect markers. 

Bateson (1909) defined epistasis as distortions from Mendelian 
segregation ratios due to one gene masking the effects of another. 
Fisher (1918) introduced the term "epistacy," considering it to be 
any departure from a linear model in which the phenotypic effects 
of genotypes at two or more loci are assumed to be additive. Ever 
since, the terms "epistasis" and "gene-gene interaction" have often 
been used interchangeably and we make no distinction between 
these two terms here. However, the purpose of including such 
terms in any genetic model must be considered. If, for example, 
we know that segregation at each of two loci affects a particular 
phenotype, whether quantitative or binary, we already know there 
must be biological interaction. So, unless our purpose is to describe 
that interaction, no further analysis is necessary to detect its pres- 
ence. In the case of a quantitative trait, whether or not there are 
interactions can depend on the scale of measurement, so the scale 
of the outcome is relevant. Factors that are additive with respective 



to the outcome measured on one scale may not be additive on 
another (Elston, 1961; Frankel and Schork, 1996; Greenland etal., 
1998; Wang etal, 2010; Steen, 2012). Similarly, in the analysis 
of a binary trait, the link function used in a generalized linear 
model may determine whether or not interaction terms are neces- 
sary (Satagopan and Elston, 2012). If no transformation or change 
in link function can remove the interaction, it is called essential; 
in that case the best way to describe the interaction depends on 
how much of it is removable by a transformation or change of 
link function, and how much is essential. Simply describing the 
interaction by an appropriate statistical model may be useful for 
prediction in the same population as that sampled, but a predic- 
tion model may not be generalizable to other populations unless 
it is based on biological function. 

Detecting pair-wise or higher-order statistical interactions can 
require enormous computational time. In a genome-wide analysis, 
the increased computational cost makes it impractical to examine 
whether interactions are non-essential or can be better described 
by removing non-additivity. Advances in computational methods, 
such as using a GPU framework (Yung et al., 20 1 1 ; Zhu et al, 20 1 3 ) 
and parallel computing strategies may overcome this limitation. 
However, the multiple hypothesis testing issue needs to be consid- 
ered: this is the major reason why most existing epistasis studies are 
limited to searching for pair-wise interactions among a moderate 
number of genetic markers. 
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STATISTICAL METHODS FOR DETECTING STATISTICAL INTERACTIONS 

Regression-based approaches are mostly used to model and 
test interactions. The regression approach has been imple- 
mented in the epistasis module of PLINK (Purcell etal, 2007) 
to test pair-wise diallelic by diallelic epistasis for both quanti- 
tative and binary traits. An extension of the PLINK epistasis 
module, FastEpistasis, uses an efficient parallel computation algo- 
rithm to test pair- wise interactions. FastEpistasis is 15 times 
faster than PLINK using a single core computer (Schupbach 
etal., 2010). Marchini etal. (2005) proposed an approach for 
joint association analyses allowing for pair-wise interactions 
based on logistic models; their approach uses an exhaustive 
search among single-nucleotide polymorphisms (SNPs) meet- 
ing some low marginal significance threshold. The software 
package PLATO can perform linear or logistic regression inter- 
action analysis, calculating the full model, the reduced model, 
and the likelihood ratio test comparing the two (Grady etal., 
2010). 

The advantages of regression-based approaches are the clear 
interpretation of the model and the parameters that relate geno- 
types to phenotype. However, regression-based approaches have 
many technical and computational disadvantages for testing 
higher-order interactions and require many more tests: the num- 
ber of parameters to be tested increases exponentially with the 
number of SNPs in the model. 

Model-free approaches, such as machine learning and pat- 
tern recognition, afford an alternative strategy, and are capa- 
ble of detecting high-dimensional non-linear interactions. This 
approach generally does not estimate parameters. It finds com- 
binations of SNPs that can best separate cases and controls 
associated with the disease by epistatic interactions or joint effects. 
Some model-free approaches collapse high dimensional data into 
two dimensions, such as the combinatorial partitioning method 
(CPM; Nelson etal, 2001), restricted partition method (RPM; 
Culverhouse etal, 2004), set association (Wille etal, 2003), and 
multifactor dimensionality reduction (MDR; Ritchie etal., 2001, 
2003; Hahn et al., 2003). 

Unsupervised pattern recognition has also been used to detect 
interactions. Li etal. (2011) proposed a method for family based 
studies to detect differentially inherited SNP modules by hierar- 
chically clustering SNPs that could be interactively associated with 
a disease. They first construct a genomic context-based SNP net- 
work based on adjacency on the chromosome. The association 
between each SNP and disease is evaluated on the basis of mutual 
information between SNP identity by descent sharing and affec- 
tion status sharing of pairs of siblings. Then they use a hierarchical 
clustering algorithm to find risk SNP modules (clusters) for which 
discriminative scores are locally maximal. In each module, the 
SNPs are within a certain network distance (defined as the num- 
ber of edges separating connected SNPs), and the discriminative 
score of a module is the maximum mutual information of the SNPs 
in the module, reflecting the risk associated with the module. 

A likelihood ratio-based Mann- Whitney approach (Lu et al., 
2012) and its extension (Wei etal., 20 13) are other non-parametric 
methods for detecting interaction. They use a multi-locus 
Mann-Whitney statistic to evaluate the joint association of a 
SNP combination. Using a computationally efficient forward 



selection algorithm makes these methods feasible for genome- 
wide gene-gene interaction analyses. Nevertheless, they require 
at least one SNP in the combination to have a significant 
marginal association. The non-parametric approaches do not 
suffer from the issue of an increasing number of parame- 
ters when modeling high-order interactions, but it is difficult 
to determine how the detected SNP combinations affect the 
disease, either via the single marker associations or via their 
interactions. 

Some studies test marker-marker interactions by testing link- 
age disequilibrium (LD) in the diseased population (Zhao etal., 
2006), or test the contrast of LD or Pearson correlation in cases 
and controls (Kam-Thong etal., 2010; Prabhu and Pe'er, 2012). 
These methods are based on the idea that, if two unlinked markers 
are interactively associated with a disease, the two markers will 
have LD patterns in the disease population. If controls are not 
studied, these methods assume that the controls do not exhibit 
similar patterns. 

FILTERING PIPELINES FOR EPISTATIC INTERACTIONS PRIOR TO 
ANALYSIS 

In GWAS, an exhaustive search among millions of SNPs for 
higher-order statistical interactions, or even just pair-wise inter- 
actions, could be computationally and statistically challenging. 
Filtering pipelines limit the number of tests performed between 
selected SNPs, whereas the use of computational technology 
and optimal algorithms increases the computational speed, and 
accelerates convergence if maximization is involved. While data 
driven filtering such as statistical filtering cleans the data to avoid 
the analysis of bad and least informative data, other types of 
filtering can be used purely to improve the power of interac- 
tion association analyses. In particular, filtering using biological 
knowledge limits the analysis to find the biologically most likely 
interactions. 

Knowledge-driven filtering 

Interaction models that are constructed based on specific bio- 
logical knowledge are more likely to make sense. Research 
over the last several decades has accumulated vast amounts 
of biological information that is stored in public databases. 
These include gene ontology annotation, gene-gene interac- 
tion databases, pathways, disease related gene networks and 
systems, as shown in Table 1. This information can greatly 
assist GWAS to find epistatic interactions. Many recent studies 
have used such biological knowledge and databases for filtering 
in their interaction studies. The databases have helped iden- 
tify biological pair-wise interactions among SNPs in pathways, 
and hence new associations and potential drug targets. For 
example, Liu etal. (2012) generated genome-wide SNP pairs 
based on multiple biological pathways such as KEGG, STRING, 
T2DGADB, etc. 

Biofilter is an analysis pipeline that catalogs biological infor- 
mation by integrating data from the Reactome, KEGG, GO, 
Dip Pfam, Ensembl, and NetPath (Bush etal, 2009; Render- 
grass etal., 2013b). It can build SNP-SNP models based on 
known interactions between genes and proteins in curated path- 
ways and networks. Grady etal. (2011) utilized the Biofilter 
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Table 1 | Biological information databases on gene ontology annotation, gene-gene interactions, pathways, disease related gene networks and 
systems. 



Database 



URL 



Description 



Reference 



KEGG 



lnttp://www.genome.jp/kegg/patliway.html 



GO 



DIP 



BioGRID 

Net Path 
IntAct 

IVIINT 



MIPS 



Pfam 



STRING 



MSigDB 



BioCarta 



http://www.geneontology.org/ 



http://dip.doe-mbi.ucla.edu/dip/ 

http://thebiogrid.org/ 

http://www.netpath.org/ 
http://www.ebi.ac.uk/intact/ 

http://mint.bio.uniroma2.it/mint/ 



http://mips.helmholtz- 
muenchen.de/proj/yeast/CYGD/interaction/ 

http://pfam.sanger.ac.uk/ 



http://string-db.org 



http://www.broadinstitute.org/gsea/msigdb/ 



http://www.biocarta.com/genes/ 



Reactome http://www.reactome.org/PathwayBrowser/ 



T2DGADB http://t2db.khu.ac.kr:8080/ 



KEGG is a collection of manually drawn pathway maps 
representing knowledge on the molecular interaction and 
reaction networks for metabolism, genetic information 
processing, environmental information processing, 
cellular processes, organismal systems, human diseases, 
and drug development. 

GO provides an ontology of defined terms representing 
gene product properties. The ontology covers three 
domains: cellular component, molecular function, and 
biological processes. 

Databases of experimentally determined interactions 
between proteins. 

A comprehensive resource of protein-protein and genetic 
interactions for all major model organism species. 
Resource of signal transduction pathways in humans. 
Database of molecular interactions that are derived from 
literature curation or direct user submissions. 
MINT focuses on experimentally verified protein-protein 
interactions mined from the scientific literature by expert 
curators. 

MINT now uses the IntAct database infrastructure to limit 
the duplication of efforts and to optimize future software 
development. 

The MIPS mammalian protein-protein interaction 
Database is a collection of manually curated high-quality 
interactions. 

The Pfam database is a large collection of protein families, 
each represented by multiple sequence alignments and 
hidden Markov models. There are two kinds of entries in 
Pfam: Pfam-A entries are high quality, manually curated 
families; Pfam-B entries have lower quality. 
A database of known and predicted protein interactions, 
including direct (physical) and indirect (functional) 
associations. 

Molecular signatures database, a collection of annotated 
gene sets integrating canonical pathways representing 
biological processes. 

Includes classical pathways as well as current 
suggestions for new pathways. 

The Reactome pathway database aims to provide intuitive 
bioinformatics tools for visualization, interpretation and 
analysis of pathway knowledge. 
A disease gene network database for type 2 diabetes. 



Kanehisa and Goto (2000) 



Ashburneretal. (2000) 



Xenarios etal. (2000) 

Stark etal. (2006) 

Kandasamy etal. (2010) 
Orchard etal. (2014) 

Chatr-aryamontri etal. (2007) 



Pagel etal. (2005) 



Punta etal. (2012) 



Szklarczyk etal. (2011) 



Subramanian etal. (2005) 



Nishimura (2001) 



Croft etal. (2011) 



Lim etal. (2010) 
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software to look for epistasis contributing to the risk of viro- 
logic failure. Approximately two million SNP-SNP interaction 
models were produced by Biofilter, and Grady etal. (2010) 
tested these models by using logistic regression via the software 
package PLATO. They identified interactions between SNPs in 
the TAPl and ABCC9 genes. Pendergrass etal. (2013a) iden- 
tified five significant GxG interactions associated with cataract 
using Biofilter. Bush etal. (2011) studied multiple sclerosis sus- 
ceptibility with Biofilter, identifying gene-gene interactions of 
susceptibility loci involved in the central nervous system and 
neuron function. Turner etal. (2011) used Biofilter to detect 
associations with low density lipoprotein cholesterol level, iden- 
tifying 11 significant GxG interactions, eight of which were 
replicated in a second cohort. In each of these examples, 
Biofilter generated biologically plausible gene-gene and SNP- 
SNP interaction models that were replicated in an independent 
study. 

Some studies reduce the number of tests by performing a gene- 
based, as opposed to a SNP-based, interaction test. Baranzini 
etal. (2009) combined the SNP-wise P-values to form a gene- 
wise P-value for each gene (such as using the minimum P-value 
for the gene), and superimposed the gene-wise P- values on 
a human protein interaction network to identify sub-networks 
containing a higher proportion of genes associated with mul- 
tiple sclerosis than expected by chance. Ma etal. (2013) tested 
interactions of SNP pairs that are separately located in two 
different genes as marker-based tests. To test the interaction 
between each pair of genes, they combined these marker-based 
interactions and the LD between markers into a gene-based 
statistic. 

Knowledge-driven filtering approaches can test models of 
genes that participate in the same biological pathway or net- 
work, and the interpretation of the interactions is then more 
straightforward. But their precision and power are hard to val- 
idate by simulation. Because such approaches depend on prior 
knowledge, which may not be accurate or may not be appli- 
cable to a particular dataset, they may miss what could be 
important findings among the genes for which we have little 
knowledge. 

Data-driven filtering 

Filtering based on statistical tests is data-driven. Statistical 
data-driven filtering includes, apart from SNP quality control, 
single marker associations, feature selection to keep only the 
most informative markers, and statistical tests to screen for 
potential interactions. Using data-driven filtering in GWAS can 
dramatically decrease the search space used to find interac- 
tions, so that subsequent statistical tests and machine learn- 
ing methods can be applied as an exhaustive search among 
a smaller number of SNPs. The performance of data-driven 
filtering depends on the assumptions that the statistical tests 
or filtering algorithms make. Single marker association fil- 
tering can only screen interactions among SNPs showing at 
least a moderate effect on the trait of interest, while feature 
selection filtering and variance heterogeneity filtering can be 
used to detect SNP interactions with very weak marginal SNP 
effects. 



Filtering according to single marker association. Filtering SNPs 
based on their marginal effects is frequently used for a high- 
dimensional gene-gene interaction search. It is often combined 
with biological filtering to identify interactions among SNPs 
that are marginally associated with a phenotype (Baranzini et al., 
2009; Grady etal, 2011; Turner etal, 2011; Ma etal, 2012; 
Pendergrass etal., 2013a). This approach follows the princi- 
ples of hierarchical model building in the general linear model, 
where the interaction terms are tested only after all main-effect 
terms are deemed statistically significant. Typically the signifi- 
cance threshold used is less stringent than the usual genome- 
wide threshold of 5 x 10^^. The advantage of this filtering 
is that it is easy to implement; its disadvantage is that it has 
low power for detecting interactions among low-marginal-effect 
SNPs. 

Filtering by feature selection algorithms. Feature selection 
algorithms such as Relief (Kira and Rendell, 1992), ReliefF 
(Kononenko, 1994), Tuned ReliefF (TuRF; Moore and White, 
2007), and Spatially Uniform ReliefF (SURF; Greene etal, 2009) 
can also be used. They screen pairs of diallelic SNPs that can clus- 
ter individuals with similar phenotypes, on the basis of the nine 
two-SNP genotypes, into two distinct classes (e.g., cases versus 
controls). For each individual only a small subset of neighboring 
individuals, i.e., individuals most similar to that individual over 
all the SNPs, is examined. Iterating over each individual and its 
chosen subset of neighboring individuals, SNPs are up-weighted 
for selection on the basis of belonging to the SNP pairs most 
frequently found in all such sets. Simulation results have indi- 
cated this is able to identify SNP pairs with purely non-additive 
effects in genome-wide datasets. Evaporative cooling (McKinney 
etal., 2007) is another feature selection approach which cou- 
ples mutual information and thermodynamics theory. It filters 
SNPs by removing those with least information for epistatic inter- 
actions. Such feature selection filtering is able to retain pure 
epistatic (i.e., essential) interaction between markers with low- 
marginal effects, offering a powerful alternative to single-marker 
filtering. 

Filtering by testing variance heterogeneity of phenotype among 
SNP genotypes. For a quantitative trait, the presence of gene-gene 
interactions will result in heterogeneity of the phenotype vari- 
ances among the genotypes of a single SNP, and this heterogeneity 
of phenotype variance has been proposed as a screen to priori- 
tize SNPs for interaction testing (Pare et al., 2010; Struchalin et al., 
2010). SNPs selected on the basis of variance heterogeneity would 
then be used for later gene-gene or gene-environment interac- 
tion analyses. However, unless the phenotypic means are the same 
for all the SNP genotypes, a transformation corresponding to a 
non-linear change in the scale of measurement may equalize the 
variances (Sun et al., 2013). This transformation, if it can be found, 
would eliminate any interactions detected this way. 

USING OPTIMAL SEARCH ALGORITHMS AND COMPUTATIONAL 
TECHNOLOGY TO SPEED A SCAN FOR INTERACTIONS 

Exhaustive search of interactions among millions of SNPs in 
GWAS data is computationally time-consuming. However, heuris- 
tic stochastic searching algorithms and efficient computational 
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technology, such as parallel computing and bit operation, can 
boost the computational speed and, if maximization is involved, 
speed the convergence required to calculate test statistics. Some 
interaction studies use optimal searching and computational tech- 
nology to search the whole space for potential interactions. An 
ultrafast genome-wide scan approach for SNP-SNP interactions, 
SIXPAC, employs a randomization searching algorithm - prob- 
ability approximate complete (PAC) testing - to drastically trim 
the universe of SNP combinations. The approach samples small 
groups of cases and highlights combinations of alleles carried by 
all individuals in the group. By further incorporating bit operation 
technology, SIXPAC can scan genome-wide pair-wise interactions 
in a few hours, compared to PLINK in weeks (Prabhu and Pe'er, 
2012). 

Lu etal. (2012) developed a likelihood ratio-based Mann- 
Whitney approach that can test high-order interactions. It is 
computationally efficient and only conducts one test for all the 
identified interaction, so that no adjustment is necessary for mul- 
tiple testing. A further extension of the approach introduces a 
randomizing algorithm into the scan, using ensemble tree mod- 
els (Wei etal., 2013), to increase the computational efficiency and 
convergence precision. 

Schiipbach etal. (2010) developed an efficient extension of 
the PLINK epistasis module by using a parallel computing algo- 
rithm running on multiple processors to increase the speed of an 
exhaustive scan of all SNP pairs. 

Heuristic or randomized search is much more efficient than 
exhaustive search, so it can perform a genome-wide scan of inter- 
actions among millions of SNPs without any filtering in reasonable 
time. However, it cannot guarantee reaching the optimal solu- 
tion, which means it may not find all the biologically relevant 
interactions. 

CONCLUSION 

Numerous approaches have been proposed for the analysis of 
epistatic interactions, each of which has advantages and disad- 
vantages. Regression models are easy for model interpretation, 
but they are less suitable for modeling high-order interac- 
tion on a large number of markers. Model-free approaches do 
not give an explicit explanation of interaction findings, but 
they are good at detecting high dimensional non-linear inter- 
actions. Tests for interactions by contrasting LD between cases 
and controls or by studying phenotype variance heterogene- 
ity among the different genotypes of a SNP, are two spe- 
cial tests for detecting epistasis in the absence of any main- 
effect. 

With the emergence of massive amounts of genome sequenc- 
ing data, developing efficient searching algorithms and filter 
pipelines are especially important. Heuristic searching is much 
faster than exhaustive searching, at the cost of missing some true 
positive results and finding more false positive results. Filter- 
ing pipelines based on biological knowledge have the advantage 
of providing a clearer biological explanation for the detected 
interactions, but the assumed knowledge may be limited and 
not error-free, in which case such filtering may also lead to 
testing some irrelevant interaction models and may miss novel 
and important signals. Data-driven filtering cleans the data by 



removing low quality and the least informative SNPs, but its 
performance depends on the underlying assumptions of the fil- 
ter. Because statistical and biological filtering each has unique 
features, they should be viewed as complementary to, rather 
than as competing with, each other. Through novel approaches 
for filtering and modeling GxG interactions, we may iden- 
tify more of the missing heritability for common, complex 
traits. 
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